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Abstract 

In this paper we study the problem of estimating inner- 
cyclic time intervals within repetitive motion sequences of 
top-class swimmers in a swimming channel. Interval limits 
are given by temporal occurrences of key-poses, i.e. distinc¬ 
tive postures of the body. A key-pose is defined by means of 
only one or two specific features of the complete posture. It 
is often difficult to detect such subtle features directly. We 
therefore propose the following method: Given that we ob¬ 
serve the swimmer from the side, we build a pictorial struc¬ 
ture of poselets to robustly identify random support poses 
within the regular motion of a swimmer. We formulate a 
maximum likelihood model which predicts a key-pose given 
the occurrences of multiple support poses within one stroke. 
The maximum likelihood can be extended with prior knowl¬ 
edge about the temporal location of a key-pose in order to 
improve the prediction recall. We experimentally show that 
our models reliably and robustly detect key-poses with a 
high precision and that their performance can be improved 
by extending the framework with additional camera views. 


1. Introduction 

In this work we describe an application for top-class 
competitive sports, taking into account recent developments 
in pose and motion estimation as well as time series anal¬ 
ysis. Consider the following real-life application of a cam¬ 
era system for evaluating the technique of swimmers: In 
the held of competitive swimming, a quantitative evalua¬ 
tion is highly desirable to supplement the typical qualitative 
analysis. Therefore, an athlete swims in a swimming chan¬ 
nel, a small pool where the water can be accelerated to con¬ 
stantly how from one end to the other. The pool has at least 
one glass wall and is monitored with multiple cameras from 
different angles. The athlete then performs regular swim¬ 
ming motions in one of the four swimming styles, namely 
freestyle, breaststroke, backstroke or butterhy, while being 
hlmed by one or more cameras. Figure 1(a) depicts an ex¬ 
ample snapshot from one camera for this setup. The video 


footage is evaluated afterwards by an experts who for in¬ 
stance annotates certain poses, single joints and other vari¬ 
ables of interest. Desired kinematic parameters are inferred 
from this manual evaluation in a last step. However, quan¬ 
titative (manual) evaluations are very time consuming and 
therefore only used in very few individual cases. The pro¬ 
posed solution focuses on determining all data necessary to 
automatically derive desired kinematic parameters of a top- 
level swimmer in a swimming channel, that is, we would 
like to retrieve the stroke frequency and inner-cyclic inter¬ 
vals. Concluding kinematic parameters from these intervals 
and recommending actions for improving the technique is 
not part of our approach; this is the job of a professional 
coach and falls into the held of training sciences. Hence, we 
look at this problem solely from a computer vision stand¬ 
point. The problem of determining all aforementioned pa¬ 
rameters can be reduced to the following: Given a stream 
of image frames, we would like to identify poses of special 
interest, which we label as key-poses. In general, a key-pose 
is defined by a human expert based on arbitrary features of 
the pose. A feature could for instance be the position or 
angle of the upper arm in the image. Detecting such pose 
features directly in a video stream is quite challenging due 
to heavy noise and (self-) occlusion. We therefore assume 
that the frames of interest, i.e. the ones showing a key-pose, 
are “hidden” as we cannot detect them directly. A cyclic 
motion however has a predetermined structure. Hence, we 
interpolate the occurrence of a key-pose based on points in 
the cycle that can be detected reliably. 

Therefore, we propose the following method: A database 
of joint - annotated images of swimmers is temporally clus¬ 
tered and a detector is trained for each cluster. The term 
temporally refers to a central property of each cluster, which 
should only contain image patches from a part of poses 
that appear closely within a small window of time within 
a swimming cycle. Multiple detectors are joined into a star¬ 
shaped pictorial structure, which outputs a frame descriptor 
for each image based on spatially restricted max pooling 
of each part. We aggregate all frame descriptors to form 
a set of time series in order to decide whether a detector 
is activated. Based on these activations, the occurrence of 



Figure 1. (a) A swimmer in a swimming channel (left). Arm configuration patches form clusters for poselets (right), (b) Above view on 
the swimming channel, (c) Example from the CMU motion of body database with poselet cluster examples. 


a key-pose is estimated by averaging over “good” detector 
signals. Finally, we show that our approach is not limited to 
the analysis of athletes and apply it to a human gait dataset. 

Preliminary Definitions. Before we dive into the model 
formulation, we would like to define some commonly used 
terms in this work. The smallest unit within a repetitive mo¬ 
tion is one cycle or stroke, defined by the time that passes 
between the appearance of a specific body pose and its earli¬ 
est reappearance. Freestyle swimming (like walking or run¬ 
ning) is an anti-symmetrical cyclic motion: Every pose of 
the left body half occurs approximately half a cycle later on 
the other side of the body. This observation is important 
as a detector based on gradients is not able to reliably dis¬ 
tinguish anti-symmetrical poses given that we observe the 
person from the side. 

2. Related Work 

Part based models have played a huge role in the fields 
of object detection and (human) pose estimation within the 
last years. Based on the fundamental work Fischler and 
Elschlager [9], these models represent an object through 
multiple parts which are connected via deformation terms, 
often visualized as springs, allowing for matching them in a 
flexible configuration. Various manifestations of this basic 
notion have been developed through the years, kicked of by 
Eelzenszwalb et al [8] with their deformable part models for 
object detection. Different refinements have been proposed 
specifically for human pose estimation, e.g. by enriching a 
model with additional parts to compensate for the flexibility 
of the human body [22] or by allowing rotation of parts [1]. 

While effective implementations of part detectors have 
been proposed for characteristic body parts like head and 
torso, part templates for extremities are usually weak. This 
issue has been addressed by [15], who argue that person and 


body part templates should be pose specific rather than gen¬ 
erally trained. Bourdev et al [3] also follow this notion by 
proposing the concept of poselets, generic rigid part detec¬ 
tors based on Histograms of Oriented Gradients (HoG) [6] 
as a generalization of specific body part detectors. Pose¬ 
lets lift the spatial limitation of parts being connected to 
an actual body part and encode generic parts of the body. 
Gkioxari et al [ 1 0] recently utilized poselets for training dis¬ 
criminative classifiers to specifically differentiate between 
arm configurations of a person. 

In the context of key-frame selection in videos, pose¬ 
lets have been used for human activity recognition. [16] 
proposed a framework based on poselet activations for se¬ 
lecting key-frames that represent key-states in an action se¬ 
quence. An additional classifier trained on pairwise terms 
for all activations then decides if a specific action sequence 
occurred. Carson et al [4] select action specific postures by 
matching shape information from individual frames in order 
to recognize specific tennis strokes in game footage. 

The analysis of human gait probably plays the biggest 
role in the field of periodic motion research. A big focus 
lies on the identification of a person via his/her intrinsic 
gait signature, for example by determining and tracking the 
body shape [24] or through fusion of multiple gait cycles 
[13]. More general approaches strive to recover the human 
body pose [14] in order to retrieve a full set of gait param¬ 
eters. Periodic motion in images was examined by Cutler 
& Davis [5], who use self similarity and frequency transfor¬ 
mations to obtain the frequency of the periodic motion of 
human and animal gait. 

Most work researching the tracking of people in aquatic 
environments has focused on drowning detection [7], lo¬ 
calization of athletes in swimming competitions [19] and 
motion analysis for video based swimming style recogni- 













tion [21]. A Kalman filter framework is presented in [11] 
to explicitly model the kinematics of cyclic motions of hu¬ 
mans in order to estimate the joint trajectories of backstroke 
swimmers. Ries et al [17] use Gaussian features for detect¬ 
ing a specihc pose of a swimmer in a pool with the intention 
of initializing his/her pose. The method closest to our ap¬ 
proach is presented in [23], who divide swimming cycles 
into intervals and train object detectors for each interval. 
The stroke rate is computed by counting the occurrences of 
the intervals. However, they show that arbitrary poses can¬ 
not be detected with their approach. 

3. Method 

For deducing desired key-poses from a predictable or 
repetitive motion, we build a two staged system: Firstly, a 
pictorial structure of poselets is trained in order to extract a 
descriptor for each frame in the video. Secondly, we aggre¬ 
gate all frame descriptors to time series and dehne a maxi¬ 
mum likelihood estimator for good poselet signals in order 
to predict the occurrence of a key-pose. 

3.1. Poselet Training 


per athlete. In case of (self) occlusion, we averaged be¬ 
tween joint location estimates of different annotators. The 
images cover different athletes, 3 male and 5 female, per¬ 
forming overall 20 strokes. We tried to cover most obvious 
variables that influence the conhguration space and image 
quality, e.g. different genders, body heights, physiques, 
illumination and water flow velocities (between lms~^ 
and 1.75ms“^). All swimmers are hlmed by one camera 
trough the side wall of the swimming channel, depicting 
their left body side. The camera hlms with a resolution 
of 720 X 576 pixels at 50*. From these images, groups of 
“sub-configurations” are extracted, e.g. arm-conhgurations 
(shoulder, elbow and wrist of the same arm) or leg conhg- 
urations (hips and knees). The clustering algorithm then 
groups these sets of joints and each cluster forms the foun¬ 
dation for training a linear SVM (poselet). 

Temporal Poselet Clustering. Let A = 
(ai, • • • , an)"^ G and B = (6i, • • • , 
be two conhgurations of n joints, where each ai,bi G 
(i = 1, • • • ,n) denotes a 2 dimensional location (x, y) of 
one joint. We wish to hnd a transformation that moves B 
to A so that the euclidean square norm is minimized, i.e. 


We build our system on localizable parts of the human 
pose, initially introduced as poselets by Bourdev et al.[3]. 
The original work dehnes poselets as rigid linear filters, 
based on Histograms of Oriented Gradients (HoG, [6]) fea¬ 
tures. Each filter is trained from a set of example image 
patches that are close in conhguration space. The patches 
are transformed into feature space and a linear SVM is 
trained for them. For evaluation, the resulting dense linear 
hlter is cross-correlated with a feature grid/pyramid, yield¬ 
ing a score for every placement of the hlter. 

Our approach depends heavily on precisely trained, dis¬ 
tinguishable poselets. We achieve this by extracting patches 
of characteristic parts of the motion, e.g. patches of the 
limbs, from all images. The underlying groups of joints, 
called conhgurations, are clustered by a k-means algorithm. 
From each resulting group of patches a linear SVM is 
trained. 

A simplifying assumption often made in the context 
of poselets is that the representation of the part does not 
change with part rotation. While other approaches [10] 
strive to hnd the best possible transformation between dif¬ 
ferent conhgurations by rotating, rehecting, translating and 
rescaling body conhgurations, we would like to extend the 
notion of a poselet as a representation of a part of the pose 
and additionally a small time window within a repetitive 
motion. Hence, we develop the following distance function 
for the clustering algorithm which assures that the conhgu¬ 
rations are not rotated. 

Dataset. Our dataset for training the poselets is build on 
video footage of freestyle swimmers. We annotated 1200 
images with complete conhgurations, i.e. a total of 13 joints 
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d{A, B) = \\ ai - sbi + c \\l . (1) 
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Thus the conhguration B is translated via c and resized 
with a scaling factor s. This formulation closely resembles 
the Procrustes optimization problem [18], with the impor¬ 
tant difference that we do not allow for B to be rotated and 
rehected by a linear transformation. This will assure that a 
conhguration from any point within the cyclic motion is not 
rehected or rotated onto a conhguration that is not tempo¬ 
rally close. One solution for equation 1 is given by 


d{A,B) = 


II “'^ax(0,/r[AB \ltr\BB ]b^) 


tr 


AA^ 


( 2 ) 

where A, B and Oj, bi are mean corrected matrices and 
vectors respectively. The max operator in equation 2 con¬ 
straints s to be greater or equal than zero. A negative scaling 
factor is equivalent to a point rehection at the origin of the 
coordinate system. As we stated that any kind of rehec¬ 
tion is unwanted behaviour, we force the distance function 
to the closest optimal solution w.r.t. the constraint. The de¬ 
nominator of equation 2 was suggested by Sibson [20] and 
standardizes the distances between different pairs of conhg¬ 
urations with the intention of making them comparable. 

A k-means clustering is applied using distance function 
in 2, yielding groups of image patches that occur temporally 
close within a cycle. Hence, each cluster represents a part 
of an athlete’s body within a small time frame. All image 
patches in a cluster are transformed into dense HoG grids 






and a linear SVM is trained on them, yielding one poselet 
per cluster. 

3.2. Frame Descriptor 

Recently poselets have been used in human detection and 
pose estimation [16]. 

Given a set of poselets trained from different temporally 
close configurations, we would like to pick up the notion of 
a poselet activation vector [2]. Instead of simply maximiz¬ 
ing over the cross-correlations of a poselets at all positions 
and scales of a feature pyramid, we add a spatial bias to our 
model; A poselet is only evaluated in a region relative to 
the location of an athlete, which is determined by an addi¬ 
tional detector trained for the complete conhguration of a 
swimmer. This guarantees that we do not search for a part 
of the human body if there is no person present in the image 
or close to its position. We take advantage of the temporal 
component of our system in order to decide if a poselet is 
activated by observing the score of a detector over time and 
deciding that it is activated if its score is a locally maximal. 

Mixture of poselets. Let Pi = {Fi,Wi,hi) [i = 
1 ,..., n) be a triple describing one of n poselets Fi with 
size WiX hi. The score of a poselet at position p in a feature 
pyramid is computed via the cross correlation 

score (Fi,p) = Fi* $(p) (3) 

of Fi with the underlying subwindow <h(p) in the pyramid. 
A position p = {x, y, s, w, is defined by two coordi¬ 
nates X and y, a scale s of a pyramid level and the size of the 
subwindow w x h, which equals the size of the poselet. 

Multiple poselets are combined into a mixture M = 
{Pq, Pi, ..., Pn). Similar to [6], we train a poselet Fq for 
the whole conhguration of an athlete and use it to retrieve an 
initial hypotesis po for the placement of the athlete, where 


Po = argmax score(Fo,p). (4) 

p 


This best detection po is projected to the original image size 

through Po = (xSq \ psg \ 1, woSg \ hoSg = 

{xq, yo, 1, Wo, ho)^. With this root hypothesis, we retrieve 
a score for each poselet Pi in the mixture by maximizing 
over 

sf^Fi = max score {Fi,pk), (5) 

Pk&R 


where 


R = 



'z<7 


(6) 


withz = {xk,ykV-Sk{{xo,yo)'^+p,i). The set i?restricts 
the position of a poselet by means of the Mahalanobis dis¬ 
tance. All position elements in R lie within an elliptic re¬ 
gion dehned by the covariance matrix It is 

centered around the position of the root model Fq plus an 


offset jii G of the poselet relative to the root. The size of 
this region is restricted by 7 and empirically set to 7 = 3. 
Both fj, and E can be estimated directly from the training 
data by fitting a normal distribution on pairwise offsets be¬ 
tween the root model and a poselet. The hnal output of our 
model is a frame descriptor s/ of spatially limited poselet 
scores, where 


(7) 

for frame /. Note that the scores of each poselet are not 
thresholded in any way to determine if it is activated. We 
determine if a poselet is active by examining the poselet 
scores over time instead in the next section. 

The model formulation above closely resembles a star 
shaped pictorial structure of poselets, where all parts are 
connected via deformation features p and E with a root 
model. A popular related approach, initially developed 
by Felzenszwalb et al [8] and expanded by many others 
[22, 15, 1], is called a deformable part model for object 
detection. Similar to these models, we can solve the prob¬ 
lem described in 4 and 5 efficiently using dynamic program¬ 
ming. 

Note that the root model Pq (and also all part models) 
do not necessarily have to be poselets; equivalently, they 
may be replaced by more sophisticated or complex models 
without loss of generality of this approach. 

3.3. Key-pose estimation 

In order to estimate the (regular) occurrence of a key- 
pose in a video, we firstly post-process aggregated frame 
descriptors and dehne a measure of goodness for a time se¬ 
ries of max-pooled poselet scores based on self similarity 
of the series. Secondly, we describe a maximum likelihood 
estimator for predicting a key-pose. 

The frame descriptors from section 3.2 are aggregated in 
a matrix S — {si, ■ ■ ■ , st) for the T frames of a video. 
As we trained the mixture model for temporally distinctive 
detectors, each poselet vividly acts like a sensor measuring 
the presence or absence of a body part. If the body part that 
the poselet was trained for is present in a frame at the loca¬ 
tion specihed by the deformation variables, the score of the 
poselet should be high. If the athlete continues his move¬ 
ment, the position or representation of a body part changes 
and the poselet score should decrease. The underlying as¬ 
sumption here is that the poselet “works”, i.e. that the ob¬ 
served score really resembles the image content. This is of 
course not the case for all poselets: While some configura¬ 
tions are simply not suited to be represented and reliably de¬ 
tected by a dense linear hlter based on HoG features, other 
configurations are not present for a specific swimmer. 

Recall that the poselets for our mixture model are trained 
on images depicting the swimmer from the side. We found 
that dense HoG templates trained for arms are not able to 




distinguishing between left and right arms. As a conse¬ 
quence, we observe that the score of a working poselet has 
two peaks within one stroke for anti-symmetrical swimming 
styles. This is not a problem in general if we adjust all eval¬ 
uation criteria accordingly. Before we assess the quality of 
a poselet time series, we post process all series as follows. 

Time Series Post Processing. Let Si = • • • , st,*) 

be the time series describing the score of poselet Pi over 
time. In order to compensate for the noisy output of linear 
HoG filters, we smooth each time series with a Gaussian 
filter. The activation of a poselet is then given by locations 
of local maxima nii G Mi = {rrii^t G N"*"} in Si, i.e. M 
holds the frame numbers where a poselet detection is locally 
maximal. For a good poselet, the distance rrii^t+i — 
for 1 < f < |M| equals the time of one complete stroke 
for all anti-symmetrical swimming styles. For breaststroke 
and butterfly, this time period is equivalently given by 
rrii^t+i — rrii^t, as a poselet only has one score peak within a 
stroke. We finally build a histogram for all stroke intervals 
within a sliding window of Si in order to determine the main 
stoke frequency /stroke for the swimmer and iteratively dis¬ 
card obvious false detections in Si by greedily deleting oc¬ 
currences in Mi that produce frequencies much smaller than 
/stroke- All intervals [mi^t-hnii^t+i] are called regular, iff 
- rrii^t-i - /stroke < A holds for a small A (e.g. 

A = 0.1 * /stroke/ 

Finally, we sort all poselet activation series by their 
“goodness”; By computing the error between adjacent 
groups of activations within a poselet series, we get a small 
error if the series is very regular (i.e. it is well suited for 
the prediction step and therefore a good series) and a larger 
error if the series is irregular due to additional, missed or 
false detections. These series will introduce errors in our 
key-pose predictions and are therefore ill-suited. 

Key-pose prediction. The final step in our framework 
tries to hnd the best estimate of an occurrence of a key- 
pose g given that we observe the activations of n spatially 
constricted poselet activations rui, i.e. 

g = argmax p{g\mi, ■■■ , m„) (8) 

a 

We can rewrite this MAP hypothesis by means of Bayes’ 
theorem, presupposing independence between all poselets, 
yielding 

n 

g — argmax E log(p(m,|p)) -f log(p(5)). (9) 

a i 

For starters, we assume a uniformly distributed prior and 
only look into the maximum likelihood of equation 9. 
Vividly, each poselet gives its own estimate of where a key- 
pose occurs in the time series, modeled by the likelihood 
p{m-i\g) of poselet i. The hnal position is then “averaged” 
between different estimates. The likelihoods in equation 


9 can be modeled by applying our pictorial structure of 
poselets on a set of videos where the ground truth frame 
numbers for a key-pose were annotated by a human expert. 
The resulting time series for all videos are post-processed 
as described above. The likelihood for poselet i is then 
modeled by a normal distribution for ground truths relative 
to poselet activations. We therefore collect regular activa¬ 
tion intervals [mi^t-i-,'mi^t+i]- The anti-symmetry property 
of freestyle swimmers yields two ground truth occurrences 
gt-i, gt € N"*" (i.e. one for the left arm, one for the right 
arm) in between them. The likelihood p{mi\g) is modeled 
by a Gaussian Af{x', pi, at) fitted to all c, where 


'kfii^t+i — nii^t-i 

The denominator normalized the occurrence gt, making c 
independent of the stroke frequency. Note that a frame gt 
depicting a key-pose has to be given here. As a key-pose is 
always defined by a human-expert, all key-poses have to be 
manually annotated in order to train a key-pose prediction 
model. 

At inference time, we directly compute the ML hypoth¬ 
esis, ignoring all non regular intervals. Each good poselet 
time series yields regular intervals rn-i^t+i] which 

generate a list of possible occurrence estimates Ppos, that is 

Ppos = mi,t-i + P-zimp+i - trii^t-i) (11) 
and an uncertainty apos for each occurrence, with 

t^pos — tTiilTli^t+l tTlit — l). (12) 


Finally, the sum over all individual likelihoods is evaluated 
within small subwindows around multiple guesses Ppos- 
Hence, an occurrence k is given by 


occfc = argmax E log(A/'(2;; Pposj tTpos)). 

^ /ipos Gsubwindow 


(13) 

We empirically found that good placements for subwin¬ 
dows are given by the locations of local maxima in 
(p(E) *X]pos-A^(2^;Mpos,crpos)), with 5(E) being a Gaus¬ 
sian smoothing kernel. 

The formulation above assumes that the prior probabil¬ 
ity p{g) is uniformly distributed, thus reducing the MAP 
hypothesis to an ML estimate. The complete hypothesis 
however assumes that we have a prior for an initial key-pose 
frame. For instance, an expert could annotate just one single 
occurrence of a key-pose for a specific swimmer manually 
in order to improve the likelihood estimate with previous 
knowledge p{g). We could then propagate this single an¬ 
notation to all other cycles by building a (probably incom¬ 
plete) model based on the idea presented in Equation 10, 
setting the standard deviation apos to a fixed value. We will 
show the effects of ignoring or setting the prior guess in the 
experimental section. 




keypose 1; upper arm vertical under water 


keypose 2; hand leaves water 


keypose 3: upper arm vertical above water 


keypose 4: hand touches water 


Figure 2. Four different key-poses for a freestyle swimmer. Key-poses occur on the left and on the right side of the body. 


4. Experimental Results 

We validate and discuss the performance of the pro¬ 
posed likelihood estimators on a set of 30 swimmer videos 
(720x576@50i) covering different freestyle swimmers (6 
male, 8 female, ages 15 to 25, different body sizes) in 
a swimming channel with slowly increasing water flow 
velocities (minimal velocity lms~^, increase of maximal 
0.3ms“^, maximal velocity 1.75ms“^). The videos show 
the swimmers from the side. An expert annotated all frames 
that depict one of four key-poses (Figure 2, overall 4 • 424 
occurrences). 

We trained a 16 part pictorial structure model of pose- 
lets (1 root, 15 arm poselets) from 1200 distinctive images. 
We do not directly evaluate the performance of each pose- 
let but instead show their efficiency in the following analy¬ 
sis of the activation sequences. The estimation of key-pose 
occurrences is evaluated in a 30-fold leave-one-out cross- 
validation, where we extract ML estimators for all combi¬ 
nations of 29 videos and evaluate their performance on the 
remaining video. 

Performance measures. In general, we distinguish dif¬ 
ferent types of detections; If our detectors estimate an oc¬ 
currence within 10 half-frames of a ground truth annota¬ 
tion, the prediction is counted as a true positive (TP) de¬ 
tection and as a false positive (FP) otherwise. A ground- 
truth frame without a prediction is a false negative (FN). In 
the following, we will compare the strokelength-normalized 
deviation of a prediction from the ground-truth annotation 
(i.e. percent of deviation from optimum, on the x-axis) 
with the recall of the system (y-axis), which is defined as 
rec = TP/{TP + FN). The recall is an indicator of 
how many key-pose occurrences we have estimated cor¬ 
rectly. All recall vs. deviation curves are evaluated at a 
deviation x = 0.03 This threshold frames the error human 
annotators make when annotating the ground truth. A devi¬ 
ation of 3% reflects a slack between ±1 to ±2 half frames 
on average (for different videos with different water flow 
velocities). For each recall vs. deviation graph, we will 
also compute the precision of the model, which is defined 
as prec = TP/{TP + FP) and an indicator for how many 


false guesses we made. 

ML, prior and MAP predictions. We evaluate the max¬ 
imum likelihood model described in section 3.3 exemplary 
for key-pose 1. Figure 3(a) visualizes the recall relative 
to the deviation from ground truth annotated frames. The 
model computes the goodness of each poselet activation 
time series. By using the time series of the 5 best perform¬ 
ing poselets, it achieves a recall of 61% at a deviation of 
3% and a precision of 0.99, which means that we made a 
small number of false estimations. False positives are pre¬ 
dicted because it is not guaranteed that we always And 5 
good performing poselets a time. We furthermore evalu¬ 
ated a prior estimate and the complete MAP prediction from 
section 3.3. A model for the prior can be extracted from 
just a single expert annotation for each key-pose in same 
manner as the ML model. This model might be incomplete 
though, as we can’t And regular stroke intervals for all pose¬ 
lets framing this one ground truth annotation. Also, we are 
not able to set an uncertainty apos for just one example, al¬ 
though we can assume that the annotation is probably very 
good for this swimmer and set a fixed small value (empir¬ 
ically: apos = 0.04 • fstroke)- We can apply this model 
alone or join it with the maximum likelihood to complete 
the MAP estimator. While the prior alone performs unsur¬ 
prisingly well for all videos and a little better than the ML, 
it is slightly surpassed by the outcome of the complete MAP 
estimate (Figure 3(a)). 

Performance improvements. In order to enhance the 
performance of our system, we continually added additional 
“bad” poselet time series to our MAP estimates (Figure 
3(b)). While the overall performance clearly improves the 
predictions up to a recall rate of 0.85 if we use all poselets 
(even disadvantageous ones) from the model, the precision 
drops down to 0.80 as an unwanted side effect. We found 
that most bad poselet time series, even the bad ones, con¬ 
tain at least some valid stroke intervals which improve the 
performance. However, they of course contain a lot of reg¬ 
ular stroke intervals which do not At the regularity of the 
complete signal; these intervals will produce a lot of false 
predictions. Surprisingly, while the ML estimator always 


















Recall vs. Slack, keypose 1, all estimators 



Recall vs. Slack, keypose 1, additional poselet; 


Deviation from ground truth annotation 

(a) 



Recall vs. Slack, all keyposes 


Deviation from ground truth annotation 

(b) 



Recall vs. Slack, human gait 


Deviation from ground truth annotation 
(c) 



Deviation from ground truth annotation 

(d) 


Figure 3. Recall of our system for different slacks, (a) Comparison of different estimation methods (ML, prior, MAP) for keypose 1. (b) 
Recall for predictions from different numbers of support poses, (c) Performance comparison for different key-poses, (d) Recall for our 
system on the CMU MoBo dataset [12] for slow walking. 


performs a little worse compared with the MAP estimator, 
it keeps its good precision longer (see precision compar¬ 
ison for green graphs). We found that the precision can 
be improved by applying the same heuristics used for post¬ 
processing single activation series in section 3.3 to the key¬ 
pose estimation series. Additionally, we condition the slid¬ 
ing window approach for averaging between different single 
poselet predictions (Equation 13) so that a minimum of two 
predictions have to be within the subwindow. Both heuris¬ 
tics effectively kill nearly all remaining false positives in 
any key-pose occurrence series, leaving us with an accept¬ 
able precision ofprec > 0.98. 

Additional camera view. The insight about improving 
estimates with additional, if possible good poselet time se¬ 
ries, inspired the following experiment: We trained an addi¬ 
tional 7-part mixture of poselets for a second camera view 
(Figure 1(b)) that captures each swimmer from above the 
swimming channel. While this camera does not monitor 
any movement below the water line because of the turbulent 
water surface, it detects the swimmer and any arm move¬ 
ment above the water line very well. Note that this model 
behaves like a model trained for a symmetrical swimming 
style: poselet activations only occur once a cycle (instead 
of two times for an anti-symmetrical swimmer observed 
from the side) because each arm is detected by its own set 
of poselets. While interpolating key-poses for this type of 
model is a bit easier (time series of anti-symmetrical styles 
can be interpreted as two independent, superimposed event 
signals, one for each body side), we want to join their time 
series with our side-view results. Therefore, we condense 
pairs of two poselet time series from the above view, so that 
the same semantic postures of the arms on both sides of the 
swimmer form a new series. For the 7-part model, we hence 
get an additional 3 time series, which we join with the other 
15 time series which we extracted from the hrst view. As 
a result the recall improves another 4.5% to an overall of 
0.89, with no signihcant change for the precision (drop of 
0.01). This hnal prediction result is depicted in Figure 3(b) 


(black graph). 

Comparison between different key-poses. Our ap¬ 
proach is of course not limited to only one key-pose for a 
cyclic motion. In fact, if we would like to compute inner- 
cyclic intervals, we need more than one measuring point 
within a stroke. Therefore, an expert annotated four dif¬ 
ferent key-poses (depicted in Figure 2) and we trained one 
combination of poselet-i-MAP model for each pose. Figure 
3(c) shows the deviation from ground truth frames for these 
four key-poses. We observe that the predictions for some 
key-poses are better than others. This behavior can be ex¬ 
plained by the fact that cyclic motion is of course not linear 
(or constant) in its acceleration and velocity. We found that 
our model has some difficulties in precisely detecting key- 
poses in intervals where the velocity of an arm is very small 
over a timespan of 10 to 20 frames. Additionally, we found 
that annotating the ground truth for worse performing es¬ 
timators was always more difficult due to heavy occlusion 
and image noise. 

Human Gait Dataset. Although we developed our mod¬ 
els originally for swimmers in swimming channels, they are 
not bound to this application. In order to prove that our ap¬ 
proach can be used for any regular cyclic motion, we car¬ 
ried out another experiment on the CMU Motion of Body 
database [12], which is one of the better known human gait 
datasets. We trained a 10 part pictorial structure for leg con- 
hgurations of 10 slowly walking persons depicted from the 
left side (Figure 1(c)). The poselet model is completed with 
a ML estimator which was trained for the key-pose where 
either the left or the right heel of the test person touches 
the ground again (end of swing phase, beginning of stance 
phase). The model was then evaluated on a different set 
of 10 video depicting slowly walking test persons with the 
same evaluation criterion as for the swimmers. The result is 
depicted in Figure 3(d). As the “walking frequency” is usu¬ 
ally higher than the stroke-frequency of a swimmer, an ac¬ 
ceptable deviation of ±2 frames is equivalent to 8%. Within 
this range, 80% “tap-events” are classihed correctly with a 
























precision of 0.96. 

Comparison to other approaches. Comparing our ap¬ 
proach with other system is difficult; To our knowledge, the 
detection of key-poses in repetitive motion of athletes with 
the intension of retrieving the stroke frequency and inner 
cyclic intervals has not been researched directly. The only 
comparable approach [23] trains complete object detectors 
on segments of a swimming cycle. The stroke rate is then 
extracted by counting the occurrences of an interval in a 
video. However, they show that their system is not able to 
detect arbitrary poses. 

5. Conclusion 

We presented a system for estimating the occurrences 
of key-poses of top-level swimmers in a swimming chan¬ 
nel. We showed that while it might be difficult to detect a 
key-pose feature directly, we can estimate the occurrence 
of such a pose with a high reliability. Future work will fo¬ 
cus on extending the approach to other swimming styles and 
further sports. An additional interesting question is how our 
models can be transfered to work on arbitrary (competitive) 
swimmers in swimming pools in general. 
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